Abstract
#Background Sequencing technology is widely used not only in life sciences research but also in the fields of disease diagnosis and treatment. Furthermore, the demand for this technology is gradually increasing, leading to the development of numerous sequencing methods beyond traditional sequencing techniques, contributing to advancements in the field. However, the clear fact that sequencing data constitutes personal information raises concerns about data privacy breaches if the data is misused or abused. In genomic analysis, it is possible to identify mutation information that has occurred in individuals, which raises the potential for inferring personal identification. Moreover, the transmission of genetic mutation information across generations can pose privacy infringement issues not only for individuals but also for larger groups. To mitigate these issues, various platforms for sharing sequencing data (e.g., EGA, TCGA, dbGaP) provide limited access to the data. To overcome the issues related to personal identification or privacy breaches arising during this process, we have developed a software called Varaser. This software's functionality allows it to receive BAM and VCF files from users, removing mutations included in the VCF and restoring the sequences to normal. By using this software just before the data sharing stage, users can selectively remove only the desired mutations. Thus, while removing mutations that could enable personal identification, the necessary information for research remains intact, ensuring the transparency of the data.
#Method For the mutation erase test, we analyzed sequencing data from 22 AML patients, each paired with their matched whole exome sequencing (WES) germline data. To simulate mutation removal, we randomly selected between 1,000 and 10,000 germline variants from each patient in increments of 1,000, repeating each step 10 times to ensure robustness. Following the removal of these germline variants, we called somatic mutations to assess the performance of our software. Additionally, we performed gene expression analysis using RNA-seq data after mutation removal. This analysis confirmed that the removal of mutations had minimal effects on overall gene expression levels and on the results of differential gene expression (DEG) analysis
#Results In all steps, the randomly selected mutations for removal were successfully erased and were not detected during mutation calling, confirming that the designated mutations were reliably converted to the reference sequence. This was further validated using IGV, where the proportion of alternative sequences observed before erasure was no longer present after erasure, showing only the reference sequence. Expression analysis revealed that gene expression levels remained unchanged despite mutation removal (Pearson p < 0.05). Additionally, in differential gene expression (DEG) analysis, there was minimal change in log2 fold change values, and the set of DEGs identified before and after mutation removal remained consistent.
#Conclusion Varaser provides a practical and effective approach to enhancing privacy in human sequencing data by selectively removing user-specified germline mutations while preserving critical somatic variant information. Its ability to maintain data integrity across both DNA and RNA sequencing modalities, alongside consistent mutation calling results, underscores its robustness and versatility. While certain limitations—such as incomplete handling of structural variants and potential read-mapping differences—require careful consideration, Varaser remains a valuable tool for secure data sharing in cancer genomics and precision medicine. By balancing anonymization with research utility, Varaser contributes meaningfully to the growing need for privacy-preserving strategies in large-scale genomic data dissemination.